91 research outputs found

    Generalized Shortest Path Kernel on Graphs

    Full text link
    We consider the problem of classifying graphs using graph kernels. We define a new graph kernel, called the generalized shortest path kernel, based on the number and length of shortest paths between nodes. For our example classification problem, we consider the task of classifying random graphs from two well-known families, by the number of clusters they contain. We verify empirically that the generalized shortest path kernel outperforms the original shortest path kernel on a number of datasets. We give a theoretical analysis for explaining our experimental results. In particular, we estimate distributions of the expected feature vectors for the shortest path kernel and the generalized shortest path kernel, and we show some evidence explaining why our graph kernel outperforms the shortest path kernel for our graph classification problem.Comment: Short version presented at Discovery Science 2015 in Banf

    A Principled Approach to Analyze Expressiveness and Accuracy of Graph Neural Networks

    Get PDF
    Graph neural networks (GNNs) have known an increasing success recently, with many GNN variants achieving state-of-the-art results on node and graph classification tasks. The proposed GNNs, however, often implement complex node and graph embedding schemes, which makes challenging to explain their performance. In this paper, we investigate the link between a GNN's expressiveness, that is, its ability to map different graphs to different representations, and its generalization performance in a graph classification setting. In particular , we propose a principled experimental procedure where we (i) define a practical measure for expressiveness, (ii) introduce an expressiveness-based loss function that we use to train a simple yet practical GNN that is permutation-invariant, (iii) illustrate our procedure on benchmark graph classification problems and on an original real-world application. Our results reveal that expressiveness alone does not guarantee a better performance, and that a powerful GNN should be able to produce graph representations that are well separated with respect to the class of the corresponding graphs

    A quantitative approach to study indirect effects among disease proteins in the human protein interaction network

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Systems biology makes it possible to study larger and more intricate systems than before, so it is now possible to look at the molecular basis of several diseases in parallel. Analyzing the interaction network of proteins in the cell can be the key to understand how complex processes lead to diseases. Novel tools in network analysis provide the possibility to quantify the key interacting proteins in large networks as well as proteins that connect them. Here we suggest a new method to study the relationships between topology and functionality of the protein-protein interaction network, by identifying key mediator proteins possibly maintaining indirect relationships among proteins causing various diseases.</p> <p>Results</p> <p>Based on the i2d and OMIM databases, we have constructed (i) a network of proteins causing five selected diseases (DP, disease proteins) plus their interacting partners (IP, non-disease proteins), the DPIP network and (ii) a protein network showing only these IPs and their interactions, the IP network. The five investigated diseases were (1) various cancers, (2) heart diseases, (3) obesity, (4) diabetes and (5) autism. We have quantified the number and strength of IP-mediated indirect effects between the five groups of disease proteins and hypothetically identified the most important mediator proteins linking heart disease to obesity or diabetes in the IP network. The results present the relationship between mediator role and centrality, as well as between mediator role and functional properties of these proteins.</p> <p>Conclusions</p> <p>We show that a protein which plays an important indirect mediator role between two diseases is not necessarily a hub in the PPI network. This may suggest that, even if hub proteins and disease proteins are trivially of great interest, mediators may also deserve more attention, especially if disease-disease associations are to be understood. Identifying the hubs may not be sufficient to understand particular pathways. We have found that the mediators between heart diseases and obesity, as well as heart diseases and diabetes are of relatively high functional importance in the cell. The mediator proteins suggested here should be experimentally tested as products of hypothetical disease-related proteins.</p

    The Born supremacy: quantum advantage and training of an Ising Born machine

    Get PDF
    The search for an application of near-term quantum devices is widespread. Quantum Machine Learning is touted as a potential utilisation of such devices, particularly those which are out of the reach of the simulation capabilities of classical computers. In this work, we propose a generative Quantum Machine Learning Model, called the Ising Born Machine (IBM), which we show cannot, in the worst case, and up to suitable notions of error, be simulated efficiently by a classical device. We also show this holds for all the circuit families encountered during training. In particular, we explore quantum circuit learning using non-universal circuits derived from Ising Model Hamiltonians, which are implementable on near term quantum devices. We propose two novel training methods for the IBM by utilising the Stein Discrepancy and the Sinkhorn Divergence cost functions. We show numerically, both using a simulator within Rigetti's Forest platform and on the Aspen-1 16Q chip, that the cost functions we suggest outperform the more commonly used Maximum Mean Discrepancy (MMD) for differentiable training. We also propose an improvement to the MMD by proposing a novel utilisation of quantum kernels which we demonstrate provides improvements over its classical counterpart. We discuss the potential of these methods to learn `hard' quantum distributions, a feat which would demonstrate the advantage of quantum over classical computers, and provide the first formal definitions for what we call `Quantum Learning Supremacy'. Finally, we propose a novel view on the area of quantum circuit compilation by using the IBM to `mimic' target quantum circuits using classical output data only.Comment: v3 : Close to journal published version - significant text structure change, split into main text & appendices. See v2 for unsplit version; v2 : Typos corrected, figures altered slightly; v1 : 68 pages, 39 Figures. Comments welcome. Implementation at https://github.com/BrianCoyle/IsingBornMachin

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    GENN: A GEneral Neural Network for Learning Tabulated Data with Examples from Protein Structure Prediction

    Get PDF
    We present a GEneral Neural Network (GENN) for learning trends from existing data and making predictions of unknown information. The main novelty of GENN is in its generality, simplicity of use, and its specific handling of windowed input/output. Its main strength is its efficient handling of the input data, enabling learning from large datasets. GENN is built on a two-layered neural network and has the option to use separate inputs–output pairs or window-based data using data structures to efficiently represent input–output pairs. The program was tested on predicting the accessible surface area of globular proteins, scoring proteins according to similarity to native, predicting protein disorder, and has performed remarkably well. In this paper we describe the program and its use. Specifically, we give as an example the construction of a similarity to native protein scoring function that was constructed using GENN. The source code and Linux executables for GENN are available from Research and Information Systems at http://mamiris.com and from the Battelle Center for Mathematical Medicine at http://mathmed.org. Bugs and problems with the GENN program should be reported to EF

    Automatic prediction of catalytic residues by modeling residue structural neighborhood

    Get PDF
    Background: Prediction of catalytic residues is a major step in characterizing the function of enzymes. In its simpler formulation, the problem can be cast into a binary classification task at the residue level, by predicting whether the residue is directly involved in the catalytic process. The task is quite hard also when structural information is available, due to the rather wide range of roles a functional residue can play and to the large imbalance between the number of catalytic and non-catalytic residues.Results: We developed an effective representation of structural information by modeling spherical regions around candidate residues, and extracting statistics on the properties of their content such as physico-chemical properties, atomic density, flexibility, presence of water molecules. We trained an SVM classifier combining our features with sequence-based information and previously developed 3D features, and compared its performance with the most recent state-of-the-art approaches on different benchmark datasets. We further analyzed the discriminant power of the information provided by the presence of heterogens in the residue neighborhood.Conclusions: Our structure-based method achieves consistent improvements on all tested datasets over both sequence-based and structure-based state-of-the-art approaches. Structural neighborhood information is shown to be responsible for such results, and predicting the presence of nearby heterogens seems to be a promising direction for further improvements.Journal ArticleResearch Support, N.I.H. Extramuralinfo:eu-repo/semantics/publishe

    Is EC class predictable from reaction mechanism?

    Get PDF
    We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe
    corecore